suppressPackageStartupMessages(library(tidyverse))

3.2.4 Exercises

  1. Run ggplot(data = mpg) what do you see?
ggplot(data = mpg)

There are no aesthetics or geom layers, so ggplot will not display any data.

  1. How many rows are in mtcars? How many columns?
# Rows
nrow(mtcars)
## [1] 32
# Columns
ncol(mtcars)
## [1] 11
# Both
dim(mtcars)
## [1] 32 11
  1. What does the drv variable describe? Read the help for ?mpg to find out.

The type of drive that can be:

  1. Make a scatterplot of hwy vs cyl.
ggplot(mpg, aes(x = hwy, y = cyl)) + 
  geom_point()

  1. What happens if you make a scatterplot of class vs drv. Why is the plot not useful?
ggplot(mpg, aes(x = class, y = drv)) + 
  geom_point()

Lots of values are overlapping and the plot does not reveal which points are overlapped and how many times.

3.3.1 Exercises

  1. What’s gone wrong with this code? Why are the points not blue?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

The colour is specified inside the aes, while the colour “blue” does not relate to the data aesthetics. This should be:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

  1. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?
mpg
## # A tibble: 234 x 11
##    manufacturer      model displ  year   cyl      trans   drv   cty   hwy
##           <chr>      <chr> <dbl> <int> <int>      <chr> <chr> <int> <int>
## 1          audi         a4   1.8  1999     4   auto(l5)     f    18    29
## 2          audi         a4   1.8  1999     4 manual(m5)     f    21    29
## 3          audi         a4   2.0  2008     4 manual(m6)     f    20    31
## 4          audi         a4   2.0  2008     4   auto(av)     f    21    30
## 5          audi         a4   2.8  1999     6   auto(l5)     f    16    26
## 6          audi         a4   2.8  1999     6 manual(m5)     f    18    26
## 7          audi         a4   3.1  2008     6   auto(av)     f    18    27
## 8          audi a4 quattro   1.8  1999     4 manual(m5)     4    18    26
## 9          audi a4 quattro   1.8  1999     4   auto(l5)     4    16    25
## 10         audi a4 quattro   2.0  2008     4 manual(m6)     4    20    28
## # ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>

The data is R does not contain any categorical variables. This can be seen by looking at the variable types displayed under the variable names. None of them is marked as .

However, when reading the documentation it becomes clear that at least the following variables should be categorical: cyl, drv, fl and class.

  1. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = cty))

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = cty))

Colour and shape will indicate how the size is spread across the displayed points on the plot. Shape can be applied only on the discreet variables, so this is not an option.

  1. What happens if you map the same variable to multiple aesthetics?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = cty, size = cty))

All aesthetics will be applied at same time that often might be overkill.

  1. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), shape = 21)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), shape = 21, stroke = 1)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), shape = 21, stroke = 5)

Stroke will modify the width of the border for the shapes that have border, like for example shape 21.

  1. What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, colour = displ < 5))

The aesthetics will apply on the (evaluated) expression in the same way as if it was a single variable.

3.5.1 Exercises

  1. What happens if you facet on a continuous variable?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = drv, y = cyl)) +
  facet_wrap( ~ displ)

Not a good idea! R makes separate plot for each unique value of the variable.

  1. What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = drv, y = cyl)) +
  facet_grid(drv ~ cyl)

There are combinations that are not represented in the data set. For example, none of the observations has values like drv=f4 & cyl=f.

  1. What plots does the following code make? What does . do?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

The first plot is faceting drv column wise against the columns specified in the aesthetics. The second plot is faceting hwy row wise and against the columns specified in the aesthetics.

  1. Take the first faceted plot in this section:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

What are the advantages to using faceting instead of the colour aesthetic?

What are the disadvantages? How might the balance change if you had a larger dataset?

The data is visually separated and hence may be easier to analyse the different cases that relate to single variable value.

At the other side, this way of plotting might make it more difficult to compare cross-values.

If there is lots of data the plots could be fuller. In combination with lots of facets this can be prove difficult to read. However, by dividing lots of data in smaller groups might reveal some patterns that are difficult to see on one plot. Depending on the content there could be pro or con for faceting big data set, just like they are in a case of the small data sets.

  1. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol variables?
  1. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?

If there is only one level, the plot will be the same as if it is not faceted

3.6.1 Exercises

  1. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?
# Line chart
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_line()

# Boxpolt
ggplot(data = mpg, mapping = aes(x = factor(cyl), y = hwy)) +
  geom_boxplot()

# Histogram
ggplot(data = mpg, mapping = aes(x = displ)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Area chart
ggplot(data = mpg, mapping = aes(x = displ)) +
  geom_area(stat = "bin")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

The scatter plot with displ on x and hwy on y.

The dots are coloured by drv levels.

The prediction lines without the confidence interval are displayed per drv level and in the same colour as the points.

  1. What does show.legend = FALSE do? What happens if you remove it?
    Why do you think I used it earlier in the chapter?
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point(show.legend = FALSE)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point()
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point(show.legend = TRUE)

This parameter determines whether to display the legend. It’s defaulted to TRUE, so it only has to be included if the legend is not desirable. Using the legend is very useful, but it some cases is better to omit it. For example, when the plot is too busy and you need space or when you plot multiple layers and legend is repeated.

  1. What does the se argument to geom_smooth() do?
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'loess'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess'

It draws confidence interval round the smooth line(s).

  1. Will these two graphs look different? Why/why not?
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))

They will look the same because they contain exactly same data and aesthetics. The only difference is that the first version reuses code for plotting the points and smooth lines, while the second version contains explicit (but same) code base for each version.

  1. Recreate the R code necessary to generate the following graphs.

3.7.1 Exercises

  1. What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?
# Original plot
ggplot(data = diamonds) + 
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )

# Rewritten plot
ggplot(data = diamonds) + 
  geom_pointrange(
    mapping = aes(x = cut, y = depth),
    stat = "summary",
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )  

stat_summary is statistical function that transforms data into summaries

  1. What does geom_col() do? How is it different to geom_bar()?
ggplot(diamonds) +
  geom_bar(aes(x = cut))

ggplot(diamonds) +
  geom_bar(aes(x = cut, weight = price))

ggplot(diamonds, ) +
  geom_col(aes(x = cut, y = price))

geom_bar creates bar charts that shows counts (or sums of weights). It uses stat_count to get the statistics.

geom_col is used when the hights of the bar’s are representing actual values. It uses stat_identity, i.e. it leaves the data as it is.

  1. Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

They represent same statistical transformation. Here are some examples

  1. What variables does stat_smooth() compute? What parameters control its behaviour?
  1. In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop..))
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))

The values are grouped in equal parts. They need to be represented as proportions of the number of inspected occurrences with respect to the number of total occurrences per level.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color), position ="fill")

3.8.1 Exercises

  1. What is the problem with this plot? How could you improve it?
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point()

Lot of points are overlapping each other. That can be avoided by using position = “jitter” or geom_jitter

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_jitter()

  1. What parameters to geom_jitter() control the amount of jittering?

Width and hight

# Default
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_jitter()

# Width
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_jitter(width = 0.01)

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_jitter(width = 0.5)

# Hight
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_jitter(hight = 0.1)
## Warning: Ignoring unknown parameters: hight

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_jitter(hight = 20)
## Warning: Ignoring unknown parameters: hight

# Width and hight
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_jitter(width = 0.1, hight = 20)
## Warning: Ignoring unknown parameters: hight

  1. Compare and contrast geom_jitter() with geom_count().
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_jitter()

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_count()

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point()

Jitter provides insight in the real distribution by placing points with random noise so that there will be no overlapping, while count presents the density of overlapping points by increasing the size of the points. geom_point just places points on a top of each other so that it is visible where are the covered points.

  1. What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.
# Default (dodge)
ggplot(mpg, aes(x = factor(cyl), y = hwy, colour = factor(drv))) +
  geom_boxplot()

# Other
ggplot(mpg, aes(x = factor(cyl), y = hwy, colour = factor(drv))) +
  geom_boxplot(position = "dodge")

ggplot(mpg, aes(x = factor(cyl), y = hwy, colour = factor(drv))) +
  geom_boxplot(position = "jitter")

ggplot(mpg, aes(x = factor(cyl), y = hwy, colour = factor(drv))) +
  geom_boxplot(position = "nudge")

Default position is dodge. Not all of the positions make sense for this type of the plot.

3.9.1 Exercises

  1. Turn a stacked bar chart into a pie chart using coord_polar().
ggplot(mpg, aes(x = fl, fill = drv)) +
  geom_bar(position = "stack")

ggplot(mpg, aes(x = fl, fill = drv)) +
  geom_bar() +
  coord_polar()

  1. What does labs() do? Read the documentation.

labs enables change of the default text on the plot, such as lables and titles:

# Default
ggplot(mtcars, aes(mpg, wt, colour = cyl)) + 
   geom_point()

# Changed
ggplot(mtcars, aes(mpg, wt, colour = cyl)) + 
   geom_point() + 
  labs(colour = "Cylinders") + 
  labs(x = "Mileage per gallon", y = "Weight") + 
  labs(caption = "(based on data from mpg data set)") +
  labs(title = "Millage per gallon vs. weight", subtitle = "Scatter plot")

  1. What’s the difference between coord_quickmap() and coord_map()?
library(maps)
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
library(mapproj)
nz <- map_data("nz")
# Prepare a map of NZ
nzmap <- ggplot(nz, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill = "white", colour = "black")
# Plot it in cartesian coordinates
nzmap + labs(caption = "Default")

# With correct mercator projection
nzmap + coord_map() + labs(caption = "coord_map")

# With the aspect ratio approximation
nzmap + coord_quickmap() + labs(caption = "coord_quickmap")

coord_map wil convert the map to display area in realistic proportions.

coord_quickmap is using less calculations for the conversion so it is quicker to display but less precise.

  1. What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed(), important? What does geom_abline() do?
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()

coord_fixed does not resize the x or y to fit the data on the available space in the best manner. It keeps the ratio 1:1

geom_abline draws a line. If no parameters are provided, the line will be specified as x = y but it can be changed by specifying slope (angle) and intercept (y point when x = 0)

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed(ratio = 1/5)